🌍 Web sites we are visiting

Check the Rmd file for code to do polite introductions to these web sites.

⚽ Exercise 11A

Motivated by Ryo’s blog post, we are going to scrape soccer records from Wikipedia.

  1. We’re going to first extract records on Asian Cup men’s soccer. This can be done with the code below.

  2. How can we know that the data is organised as a table in the html? Go to the web site in your Google Chrome browser and use InspectorGadget to see the elements in the html.

  3. How did we know to extract the data from the 6th table? Try changing the 6 in the code html_table(tbl[[6]], fill=TRUE) and compare the result with the tables presented in the web page.

  4. Let’s use the data to make a chart. Examine the goals for and against, of each team. add a guide line where the for and against are equal. Also make the plot interactive, using plotly, so you can browse over team names. Which teams have scored more goals than have been scored against them, over all time? How does this compare with the champions list?

  1. How did we know that these were tables in the web page? Here’s where its important to learn how to use the developer tools in your web browser. Open the page in Google Chrome, scroll down to the table of interest, and right click to choose Inspect:

This will bring up the page source. Can you see the <table tag? and see the same information that is visible in the web page itself?

🏏 Exercise 11B

  1. Using the fetch_cricinfo function from the cricketdata package, extract all the records for Australian women’s T20 matches. And answer the following questions:
# remotes::install_github("ropenscilabs/cricketdata")
library(cricketdata)
auswt20 <- fetch_cricinfo("T20", "Women", country="Aust")
## # A tibble: 1 x 1
##       n
##   <int>
## 1    53
## # A tibble: 1 x 2
##   Start   End
##   <int> <int>
## 1  2005  2020
## # A tibble: 53 x 2
##    Player       Matches
##    <chr>          <int>
##  1 EA Perry         120
##  2 AJ Healy         112
##  3 MM Lanning       104
##  4 AJ Blackwell      95
##  5 JL Jonassen       79
##  6 RL Haynes         67
##  7 M Schutt          67
##  8 JE Duffin         64
##  9 EJ Villani        62
## 10 EA Osborne        59
## # … with 43 more rows

  1. Take a look at the code for the function fetch_cricinfo. The work is mostly done by another hidden function cricketdata:::fetch_cricket_data. A key part of this function occurs in these lines:
url <- paste0("http://stats.espncricinfo.com/ci/engine/stats/index.html?class=", 
            matchclass, ifelse(is.null(country), "", 
            paste0(";team=", 
                team)), ";page=", 
                format(page, scientific = FALSE), 
            ";template=results;type=", activity, 
            view_text, ";size=200;wrappertype=print")

Try setting these values, and creating the URL manually. When you have it right, you will have found the page that the data is extracted from! (Alternatively, if this is too frustrating try working with the “statguru” query tool to find the table of interest.)

Using your URL, use the read_html and html_table functions to reproduce what the cricketdata package did for you.

Note that the package returned 53 records because there were two pages of data. The manual code only scraped one of the two pages.

🤔 Use the inspect tool on the web page, to see that it is indeed an html table.

🎾 Exercise 11C

Sometimes pages are dynamically created, which means that it isn’t possible too directly extract the data. The women’s tennis records web site uses dynamic web pages.

  1. Have a look at what happens when you directly try to read the women’s stats web page.
url <- "https://www.wtatennis.com/stats"
wta_html <- read_html(url)
wta_rankings <- html_node(wta_html, "table")
  1. Now save a copy of the page, and re-try to read it.
# Save web page source locally, because it contains javascript content
wta_html <- read_html("wta_rankings2.htm")
wta_rankings <- html_node(wta_html, "table") %>% html_table(fill=TRUE) 
# There is only one table in page so use html_node rather than html_nodes
wta_rankings <- wta_rankings %>% 
  janitor::remove_empty() %>% 
  as_tibble()

This time you’ve got data. How can you tell that the page is dynamic? Inspect the source, and you will find script. This often indicates that some javascript is used to create the table.

  1. Use your wrangling skills to clean up the data, making numeric variables numeric

  2. Make a plot with the data you’ve just extracted. Your choice.